{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "RUymE2l9GZfO"
      },
      "source": [
        "**Copyright 2019 The TensorFlow Hub Authors.**\n",
        "\n",
        "Licensed under the Apache License, Version 2.0 (the \"License\");"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "code",
        "colab": {},
        "colab_type": "code",
        "id": "JMyTNwSJGGWg"
      },
      "outputs": [],
      "source": [
        "# Copyright 2019 The TensorFlow Hub Authors. All Rights Reserved.\n",
        "#\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "#     http://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License.\n",
        "# =============================================================================="
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "co7MV6sX7Xto"
      },
      "source": [
        "# Cross-Lingual Similarity and Semantic Search Engine with Multilingual Universal Sentence Encoder\n",
        "\n",
        "\n",
        "\u003ctable align=\"left\"\u003e\u003ctd\u003e\n",
        "  \u003ca target=\"_blank\"  href=\"https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb\"\u003e\n",
        "    \u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\n",
        "  \u003c/a\u003e\n",
        "\u003c/td\u003e\u003ctd\u003e\n",
        "  \u003ca target=\"_blank\"  href=\"https://github.com/tensorflow/hub/blob/master/examples/colab/cross_lingual_similarity_with_tf_hub_multilingual_universal_encoder.ipynb\"\u003e\n",
        "    \u003cimg width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView source on GitHub\u003c/a\u003e\n",
        "\u003c/td\u003e\u003c/table\u003e\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "eAVQGidpL8v5"
      },
      "source": [
        "This notebook illustrates how to access the Multilingual Universal Sentence Encoder module and use it for sentence similarity across multiple languages. This module is an extension of the [original Universal Encoder module](https://tfhub.dev/google/universal-sentence-encoder/2).\n",
        "\n",
        "The notebook is divided as follows:\n",
        "\n",
        "*   The first section shows a visualization of sentences between pair of languages. This is a more academic exercise.\n",
        "*   In the second section, we show how to build a semantic search engine from a sample of a Wikipedia corpus in multiple languages."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "UvNRbHGarYeR"
      },
      "source": [
        "## Citation\n",
        "\n",
        "*Research papers that make use of the models explored in this colab should cite:*\n",
        "\n",
        "### [Multilingual universal sentence encoder for semantic retrieval](https://arxiv.org/abs/1907.04307)\n",
        "Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019.\n",
        " arXiv preprint arXiv:1907.04307"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "pOTzp8O36CyQ"
      },
      "source": [
        "# Getting Started\n",
        "\n",
        "This section sets up the environment for access to the Multilingual Universal Sentence Encoder Module and also prepares a set of English sentences and their translations. In the following sections, the multilingual module will be used to compute similarity *across languages*."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "63Pd3nJnTl-i"
      },
      "source": [
        "**IMPORTANT**Note: Pleaseelect \"**Python 3**\" _and_ \"**GPU**\" in the ***Runtime-\u003eChange Runtime type*** dropdown menu above _before_ running this notebook for faster execution."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "lVjNK8shFKOC"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "#@title Setup Environment\n",
        "# Install the latest Tensorflow version.\n",
        "!pip3 install tensorflow_text\n",
        "!pip3 install --upgrade tensorflow-gpu\n",
        "!pip install tensorflow-hub\n",
        "!pip install bokeh\n",
        "!pip install simpleneighbors\n",
        "!pip install tqdm"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "MSeY-MUQo2Ha"
      },
      "outputs": [],
      "source": [
        "#@title Setup common imports and functions\n",
        "import bokeh\n",
        "import bokeh.models\n",
        "import bokeh.plotting\n",
        "import numpy as np\n",
        "import os\n",
        "import pandas as pd\n",
        "import tensorflow.compat.v2 as tf\n",
        "import tensorflow_hub as hub\n",
        "from tensorflow_text import SentencepieceTokenizer\n",
        "import sklearn.metrics.pairwise\n",
        "\n",
        "from simpleneighbors import SimpleNeighbors\n",
        "from tqdm import tqdm\n",
        "from tqdm import trange\n",
        "\n",
        "def visualize_similarity(embeddings_1, embeddings_2, labels_1, labels_2,\n",
        "                         plot_title,\n",
        "                         plot_width=1200, plot_height=600,\n",
        "                         xaxis_font_size='12pt', yaxis_font_size='12pt'):\n",
        "\n",
        "  assert len(embeddings_1) == len(labels_1)\n",
        "  assert len(embeddings_2) == len(labels_2)\n",
        "\n",
        "  # arccos based text similarity (Yang et al. 2019; Cer et al. 2019)\n",
        "  sim = 1 - np.arccos(\n",
        "      sklearn.metrics.pairwise.cosine_similarity(embeddings_1,\n",
        "                                                 embeddings_2))/np.pi\n",
        "\n",
        "  embeddings_1_col, embeddings_2_col, sim_col = [], [], []\n",
        "  for i in range(len(embeddings_1)):\n",
        "    for j in range(len(embeddings_2)):\n",
        "      embeddings_1_col.append(labels_1[i])\n",
        "      embeddings_2_col.append(labels_2[j])\n",
        "      sim_col.append(sim[i][j])\n",
        "  df = pd.DataFrame(zip(embeddings_1_col, embeddings_2_col, sim_col),\n",
        "                    columns=['embeddings_1', 'embeddings_2', 'sim'])\n",
        "\n",
        "  mapper = bokeh.models.LinearColorMapper(\n",
        "      palette=[*reversed(bokeh.palettes.YlOrRd[9])], low=df.sim.min(),\n",
        "      high=df.sim.max())\n",
        "\n",
        "  p = bokeh.plotting.figure(title=plot_title, x_range=labels_1,\n",
        "                            x_axis_location=\"above\",\n",
        "                            y_range=[*reversed(labels_2)],\n",
        "                            plot_width=plot_width, plot_height=plot_height,\n",
        "                            tools=\"save\",toolbar_location='below', tooltips=[\n",
        "                                ('pair', '@embeddings_1 ||| @embeddings_2'),\n",
        "                                ('sim', '@sim')])\n",
        "  p.rect(x=\"embeddings_1\", y=\"embeddings_2\", width=1, height=1, source=df,\n",
        "         fill_color={'field': 'sim', 'transform': mapper}, line_color=None)\n",
        "\n",
        "  p.title.text_font_size = '12pt'\n",
        "  p.axis.axis_line_color = None\n",
        "  p.axis.major_tick_line_color = None\n",
        "  p.axis.major_label_standoff = 16\n",
        "  p.xaxis.major_label_text_font_size = xaxis_font_size\n",
        "  p.xaxis.major_label_orientation = 0.25 * np.pi\n",
        "  p.yaxis.major_label_text_font_size = yaxis_font_size\n",
        "  p.min_border_right = 300\n",
        "\n",
        "  bokeh.io.output_notebook()\n",
        "  bokeh.io.show(p)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gk2IRjZFGDsK"
      },
      "source": [
        "This is additional boilerplate code where we import the pre-trained ML model we will use to encode text throughout this notebook."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "mkmF3w8WGLcM"
      },
      "outputs": [],
      "source": [
        "# The 16-language multilingual module is the default but feel free\n",
        "# to pick others from the list and compare the results.\n",
        "module_url = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual/3' #@param ['https://tfhub.dev/google/universal-sentence-encoder-multilingual/3', 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3']\n",
        "\n",
        "model = hub.load(module_url)\n",
        "\n",
        "def embed_text(input):\n",
        "  return model(input)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "jhLPq6AROyFk"
      },
      "source": [
        "# Visualize Text Similarity Between Languages\n",
        "With the sentence embeddings now in hand, we can visualize semantic similarity across different languages."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "8xdAogbxJDTD"
      },
      "source": [
        "## Computing Text Embeddings\n",
        "\n",
        "We first define a set of sentences translated to various languages in parallel. Then, we precompute the embeddings for all of our sentences."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Q8F4LNGFqOiq"
      },
      "outputs": [],
      "source": [
        "# Some texts of different lengths in different languages.\n",
        "arabic_sentences = ['كلب', 'الجراء لطيفة.', 'أستمتع بالمشي لمسافات طويلة على طول الشاطئ مع كلبي.']\n",
        "chinese_sentences = ['狗', '小狗很好。', '我喜欢和我的狗一起沿着海滩散步。']\n",
        "english_sentences = ['dog', 'Puppies are nice.', 'I enjoy taking long walks along the beach with my dog.']\n",
        "french_sentences = ['chien', 'Les chiots sont gentils.', 'J\\'aime faire de longues promenades sur la plage avec mon chien.']\n",
        "german_sentences = ['Hund', 'Welpen sind nett.', 'Ich genieße lange Spaziergänge am Strand entlang mit meinem Hund.']\n",
        "italian_sentences = ['cane', 'I cuccioli sono carini.', 'Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.']\n",
        "japanese_sentences = ['犬', '子犬はいいです', '私は犬と一緒にビーチを散歩するのが好きです']\n",
        "korean_sentences = ['개', '강아지가 좋다.', '나는 나의 산책을 해변을 따라 길게 산책하는 것을 즐긴다.']\n",
        "russian_sentences = ['собака', 'Милые щенки.', 'Мне нравится подолгу гулять по пляжу со своей собакой.']\n",
        "spanish_sentences = ['perro', 'Los cachorros son agradables.', 'Disfruto de dar largos paseos por la playa con mi perro.']\n",
        "\n",
        "# Multilingual example\n",
        "multilingual_example = [\"Willkommen zu einfachen, aber\", \"verrassend krachtige\", \"multilingüe\", \"compréhension du langage naturel\", \"модели.\", \"大家是什么意思\" , \"보다 중요한\", \".اللغة التي يتحدثونها\"]\n",
        "multilingual_example_in_en =  [\"Welcome to simple yet\", \"surprisingly powerful\", \"multilingual\", \"natural language understanding\", \"models.\", \"What people mean\", \"matters more than\", \"the language they speak.\"]\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "weXZqLtTJY9b"
      },
      "outputs": [],
      "source": [
        "# Compute embeddings.\n",
        "ar_result = embed_text(arabic_sentences)\n",
        "en_result = embed_text(english_sentences)\n",
        "es_result = embed_text(spanish_sentences)\n",
        "de_result = embed_text(german_sentences)\n",
        "fr_result = embed_text(french_sentences)\n",
        "it_result = embed_text(italian_sentences)\n",
        "ja_result = embed_text(japanese_sentences)\n",
        "ko_result = embed_text(korean_sentences)\n",
        "ru_result = embed_text(russian_sentences)\n",
        "zh_result = embed_text(chinese_sentences)\n",
        "\n",
        "multilingual_result = embed_text(multilingual_example)\n",
        "multilingual_in_en_result = embed_text(multilingual_example_in_en)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_3zGWuF-GhUm"
      },
      "source": [
        "## Visualizing Similarity\n",
        "\n",
        "With text embeddings in hand, we can take their dot-product to visualize how similar sentences are between languages. A darker color indicates the embeddings are semantically similar."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "WOEIJA0mh70g"
      },
      "source": [
        "### Multilingual Similarity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "R2hbCMhmiDWR"
      },
      "outputs": [],
      "source": [
        "visualize_similarity(multilingual_in_en_result, multilingual_result,\n",
        "                     multilingual_example_in_en, multilingual_example,  \"Multilingual Universal Sentence Encoder for Semantic Retrieval (Yang et al., 2019)\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "h3TEhllsq3ax"
      },
      "source": [
        "### English-Arabic Similarity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Q9UDpStmq7Ii"
      },
      "outputs": [],
      "source": [
        "visualize_similarity(en_result, ar_result, english_sentences, arabic_sentences, 'English-Arabic Similarity')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "QF9z48HMp4WL"
      },
      "source": [
        "### Engish-Russian Similarity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "QE68UejYp86z"
      },
      "outputs": [],
      "source": [
        "visualize_similarity(en_result, ru_result, english_sentences, russian_sentences, 'English-Russian Similarity')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "BJkL6Az0QXNN"
      },
      "source": [
        "### English-Spanish Similarity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "CH_BXVGhQ0GL"
      },
      "outputs": [],
      "source": [
        "visualize_similarity(en_result, es_result, english_sentences, spanish_sentences, 'English-Spanish Similarity')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "imn28LCiQO7d"
      },
      "source": [
        "### English-Italian Similarity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "X9uD3DirPIGd"
      },
      "outputs": [],
      "source": [
        "visualize_similarity(en_result, it_result, english_sentences, italian_sentences, 'English-Italian Similarity')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "m6ySvEGbQaTM"
      },
      "source": [
        "### Italian-Spanish Similarity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "irfwIeitQ7V6"
      },
      "outputs": [],
      "source": [
        "visualize_similarity(it_result, es_result, italian_sentences, spanish_sentences, 'Italian-Spanish Similarity')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ueoRO8balwwr"
      },
      "source": [
        "### English-Chinese Similarity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "xA7anofVlxL7"
      },
      "outputs": [],
      "source": [
        "visualize_similarity(en_result, zh_result, english_sentences, chinese_sentences, 'English-Chinese Similarity')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "8zV1BJc3mL3W"
      },
      "source": [
        "### English-Korean Similarity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "iqWy1e1UmQeX"
      },
      "outputs": [],
      "source": [
        "visualize_similarity(en_result, ko_result, english_sentences, korean_sentences, 'English-Korean Similarity')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dfTj-JaunFTv"
      },
      "source": [
        "### Chinese-Korean Similarity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "MndSgKGPnJuF"
      },
      "outputs": [],
      "source": [
        "visualize_similarity(zh_result, ko_result, chinese_sentences, korean_sentences, 'Chinese-Korean Similarity')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "rRabHHQYQfLr"
      },
      "source": [
        "### And more...\n",
        "\n",
        "The above examples can be extended to any language pair from **English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Thai and Turkish**. Happy coding!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "mxAFAJI9xsAU"
      },
      "source": [
        "# Creating a Multilingual Semantic-Similarity Search Engine\n",
        "\n",
        "Whereas in the previous example we visualized a handful of sentences, in this section we will build a semantic-search index of about 200,000 sentences from a Wikipedia Corpus. About half will be in English and the other half in Spanish to demonstrate the multilingual capabilities of the Universal Sentence Encoder.\n",
        "\n",
        "## Download Data to Index\n",
        "First, we will download news sentences in multiples languages from the [News Commentary Corpus](http://opus.nlpl.eu/News-Commentary-v11.php) [1].  Without loss of generality, this approach should also work for indexing the rest of the supported languages."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "587I9ye6yXEU"
      },
      "outputs": [],
      "source": [
        "corpus_metadata = [\n",
        "    ('ar', 'ar-en.txt.zip', 'News-Commentary.ar-en.ar', 'Arabic'),\n",
        "    ('zh', 'en-zh.txt.zip', 'News-Commentary.en-zh.zh', 'Chinese'),\n",
        "    ('en', 'en-es.txt.zip', 'News-Commentary.en-es.en', 'English'),\n",
        "    ('ru', 'en-ru.txt.zip', 'News-Commentary.en-ru.ru', 'Russian'),\n",
        "    ('es', 'en-es.txt.zip', 'News-Commentary.en-es.es', 'Spanish'),\n",
        "]\n",
        "\n",
        "language_to_sentences = {}\n",
        "language_to_news_path = {}\n",
        "for language_code, zip_file, news_file, language_name in corpus_metadata:\n",
        "  zip_path = tf.keras.utils.get_file(\n",
        "      fname=zip_file,\n",
        "      origin='http://opus.nlpl.eu/download.php?f=News-Commentary/v11/moses/' + zip_file,\n",
        "      extract=True)\n",
        "  news_path = os.path.join(os.path.dirname(zip_path), news_file)\n",
        "  language_to_sentences[language_code] = pd.read_csv(news_path, sep='\\t', header=None)[0]\n",
        "  language_to_news_path[language_code] = news_path\n",
        "\n",
        "  print('{:,} {} sentences'.format(len(language_to_sentences[language_code]), language_name))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "m3DIT9uT7Z34"
      },
      "source": [
        "## Using a pre-trained model to transform sentences into vectors\n",
        "\n",
        "We compute embeddings in _batches_ so that they fit in the GPU's RAM."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "yRoRT5qCEIYy"
      },
      "outputs": [],
      "source": [
        "# Takes about 3 minutes\n",
        "\n",
        "batch_size = 2048\n",
        "language_to_embeddings = {}\n",
        "for language_code, zip_file, news_file, language_name in corpus_metadata:\n",
        "  print('\\nComputing {} embeddings'.format(language_name))\n",
        "  with tqdm(total=len(language_to_sentences[language_code])) as pbar:\n",
        "    for batch in pd.read_csv(language_to_news_path[language_code], sep='\\t',header=None, chunksize=batch_size):\n",
        "      language_to_embeddings.setdefault(language_code, []).extend(embed_text(batch[0]))\n",
        "      pbar.update(len(batch))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "oeBqoE8e-scg"
      },
      "source": [
        "## Building an index of semantic vectors\n",
        "\n",
        "We use the [SimpleNeighbors](https://pypi.org/project/simpleneighbors/) library---which is a wrapper for the [Annoy](https://github.com/spotify/annoy) library---to efficiently look up results from the corpus."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "lv_SOduAF1oi"
      },
      "outputs": [],
      "source": [
        "%%time\n",
        "\n",
        "# Takes about 8 minutes\n",
        "\n",
        "num_index_trees = 40\n",
        "language_name_to_index = {}\n",
        "embedding_dimensions = len(list(language_to_embeddings.values())[0][0])\n",
        "for language_code, zip_file, news_file, language_name in corpus_metadata:\n",
        "  print('\\nAdding {} embeddings to index'.format(language_name))\n",
        "  index = SimpleNeighbors(embedding_dimensions, metric='dot')\n",
        "\n",
        "  for i in trange(len(language_to_sentences[language_code])):\n",
        "    index.add_one(language_to_sentences[language_code][i], language_to_embeddings[language_code][i])\n",
        "\n",
        "  print('Building {} index with {} trees...'.format(language_name, num_index_trees))\n",
        "  index.build(n=num_index_trees)\n",
        "  language_name_to_index[language_name] = index"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "0aqGwIuLGrtu"
      },
      "outputs": [],
      "source": [
        "%%time\n",
        "\n",
        "# Takes about 13 minutes\n",
        "\n",
        "num_index_trees = 60\n",
        "print('Computing mixed-language index')\n",
        "combined_index = SimpleNeighbors(embedding_dimensions, metric='dot')\n",
        "for language_code, zip_file, news_file, language_name in corpus_metadata:\n",
        "  print('Adding {} embeddings to mixed-language index'.format(language_name))\n",
        "  for i in trange(len(language_to_sentences[language_code])):\n",
        "    annotated_sentence = '({}) {}'.format(language_name, language_to_sentences[language_code][i])\n",
        "    combined_index.add_one(annotated_sentence, language_to_embeddings[language_code][i])\n",
        "\n",
        "print('Building mixed-language index with {} trees...'.format(num_index_trees))\n",
        "combined_index.build(n=num_index_trees)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "kg9cw0S2_ntQ"
      },
      "source": [
        "## Verify that the semantic-similarity search engine works\n",
        "\n",
        "In this section we will demonstrate:\n",
        "\n",
        "1.   Semantic-search capabilities: retrieving sentences from the corpus that are semantically similar to the given query.\n",
        "2.   Multilingual capabilities: doing so in multiple languages when they query language and index language match\n",
        "3.   Cross-lingual capabilities: issuing queries in a distinct language than the indexed corpus\n",
        "4.   Mixed-language corpus: all of the above on a single index containing entries from all languages\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Dxu66S8wJIG9"
      },
      "source": [
        "### Semantic-search crosss-lingual capabilities\n",
        "\n",
        "In this section we show how to retrieve sentences related to a set of sample English sentences. Things to try:\n",
        "\n",
        "*   Try a few different sample sentences\n",
        "*   Try changing the number of returned results (they are returned in order of similarity)\n",
        "*   Try cross-lingual capabilities by returning results in different languages (might want to use [Google Translate](http://translate.google.com) on some results to your native language for sanity check)\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "_EFSd65B_mq8"
      },
      "outputs": [],
      "source": [
        "sample_query = 'The stock market fell four points.'  #@param [\"Global warming\", \"Researchers made a surprising new discovery last week.\", \"The stock market fell four points.\", \"Lawmakers will vote on the proposal tomorrow.\"] {allow-input: true}\n",
        "index_language = 'English'  #@param [\"Arabic\", \"Chinese\", \"English\", \"French\", \"German\", \"Russian\", \"Spanish\"]\n",
        "num_results = 10  #@param {type:\"slider\", min:0, max:100, step:10}\n",
        "\n",
        "query_embedding = embed_text(sample_query)\n",
        "\n",
        "search_results = language_name_to_index[index_language].nearest(query_embedding, n=num_results)\n",
        "\n",
        "print('{} sentences similar to: \"{}\"\\n'.format(index_language, sample_query))\n",
        "search_results"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Ybgj9o7hKDZV"
      },
      "source": [
        "### Mixed-corpus capabilities\n",
        "\n",
        "We will now issue a query in English, but the results will come from the any of the indexed languages."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "MJeTzuj0KU41"
      },
      "outputs": [],
      "source": [
        "sample_query = 'The stock market fell four points.'  #@param [\"Global warming\", \"Researchers made a surprising new discovery last week.\", \"The stock market fell four points.\", \"Lawmakers will vote on the proposal tomorrow.\"] {allow-input: true}\n",
        "num_results = 40  #@param {type:\"slider\", min:0, max:100, step:10}\n",
        "query_embedding = session.run(embedded_text, feed_dict={text_input: [sample_query]})[0]\n",
        "\n",
        "search_results = combined_index.nearest(query_embedding, n=num_results)\n",
        "\n",
        "print('{} sentences similar to: \"{}\"\\n'.format(index_language, sample_query))\n",
        "search_results"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "tqIRtHIL2jAw"
      },
      "source": [
        "Try your own queries:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "ZK5ID6XF2n8k"
      },
      "outputs": [],
      "source": [
        "query = 'The stock market fell four points.'  #@param {type:\"string\"}\n",
        "num_results = 30  #@param {type:\"slider\", min:0, max:100, step:10}\n",
        "query_embedding = session.run(embedded_text, feed_dict={text_input: [query]})[0]\n",
        "\n",
        "search_results = combined_index.nearest(query_embedding, n=num_results)\n",
        "\n",
        "print('{} sentences similar to: \"{}\"\\n'.format(index_language, query))\n",
        "search_results"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "IPPwnhUNdOfc"
      },
      "source": [
        "# Further topics\n",
        "\n",
        "## Multilingual\n",
        "\n",
        "Finally, we encourage you to try queries in any of the supported languages: **English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Thai and Turkish**.\n",
        "\n",
        "Also, even though we only indexed in a subset of the languages, you can also index content in any of the supported languages.\n",
        "\n",
        "\n",
        "## Model variations\n",
        "\n",
        "We offer variations of the Universal Encoder models optimized for various things like memory, latency and/or quality. Please feel free to experiment with them to find a suitable one.\n",
        "\n",
        "## Nearest neighbor libraries\n",
        "\n",
        "We used Annoy to efficiently look up nearest neighbors. See the [tradeoffs section](https://github.com/spotify/annoy/blob/master/README.rst#tradeoffs) to read about the number of trees (memory-dependent) and number of items to search (latency-dependent)---SimpleNeighbors only allows to control the number of trees, but refactoring the code to use Annoy directly should be simple, we just wanted to keep this code as simple as possible for the general user.\n",
        "\n",
        "If Annoy does not scale for your application, please also check out [FAISS](https://github.com/facebookresearch/faiss)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "5yj9VcfnbS-q"
      },
      "source": [
        "*All the best building your multilingual semantic applications!*"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "X4oOfvSOKnjS"
      },
      "source": [
        "[1] J. Tiedemann, 2012, [Parallel Data, Tools and Interfaces in OPUS](http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf). In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "machine_shape": "hm",
      "name": "Cross-Lingual Similarity and Semantic Search Engine with TF-Hub Multilingual Universal Encoder",
      "provenance": [],
      "version": "0.3.2"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
